Skip to content

Conversation

hknlof
Copy link

@hknlof hknlof commented Aug 1, 2025

#13323

Which issue does this PR close?

Rationale for this change

DF.write_parquet writes multiple files / one directory even if options.single_file_output is set.

What changes are included in this PR?

Introduce an internal .single extension.

Are these changes tested?

Yes, tests are part of this PR.

Are there any user-facing changes?

Not in this implementation. There might be, if we decide to move to an FileSinkConfig based solution.

Quoting: #13323 (comment)

It seems hard to control the behavior of write_parquet by single_file_output(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()), there are several methods I can think of to handle this issue:

  1. We can add a suffix like .single to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task().
  2. Give up single_file_output in DataFrameWriteOptions, use FileSinkConfig instead to control single file behavior.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels Aug 1, 2025
@hknlof hknlof force-pushed the fix/write-single-parquet branch from 1c05032 to f1d58c5 Compare August 4, 2025 14:32
@alamb
Copy link
Contributor

alamb commented Aug 7, 2025

I restarted the checks

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Aug 20, 2025
@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

Sorry for the delay -- this PR seems to have quite. few conflicts.

@hknlof hknlof force-pushed the fix/write-single-parquet branch from 00148b2 to 651cfb8 Compare August 25, 2025 08:15
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules labels Aug 25, 2025
@github-actions github-actions bot removed sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Aug 25, 2025
@hknlof
Copy link
Author

hknlof commented Aug 25, 2025

Sorry for the delay -- this PR seems to have quite. few conflicts.

No worries! My bad, I forgot to sync before rebasing the branch onto main again. Thx!

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of implicitly encoding single fileness via adding an extension to the path does seem quite hacky; not sure if this is the correct approach to follow 🤔

Comment on lines +86 to +94
let path = if file_type.get_ext() != DEFAULT_PARQUET_EXTENSION
&& options.single_file_output
{
let mut path = path.to_owned();
path.push_str(SINGLE_FILE_EXTENSION);
path
} else {
path.to_owned()
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part confuses me; If I'm not wrong file_type.get_ext() != DEFAULT_PARQUET_EXTENSION will always be true, because:

impl GetExt for ParquetFormatFactory {
fn get_ext(&self) -> String {
// Removes the dot, i.e. ".parquet" -> "parquet"
DEFAULT_PARQUET_EXTENSION[1..].to_string()
}
}

Comment on lines +250 to +254
let test_path = std::path::Path::new(&path);
assert!(
test_path.is_dir(),
"No extension and default DataFrameWriteOptons should have yielded a dir."
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to also check there are indeed multiple parquet files

Comment on lines +288 to +292
let test_path = std::path::Path::new(&path);
assert!(
test_path.is_file(),
"No extension and DataFrameWriteOptons::with_single_file_output(true) should have yielded a single file."
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here we would need to check that only one file is written, not just a file is written

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: DataFrameWriteOptions::with_single_file_output produces a directory
3 participants